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Abstract 

(N ■ 

►> ■ A critical problem in the emerging high-throughput gcnotyping pro- 

{•f. ' tocols is to minimize the number of polymerase chain reaction (PCR) 

|y-s . primers required to amplify the single nucleotide polymorphism loci of 

(^—) ■ interest. In this paper we study PCR primer set selection with amplifica- 

\Q ' tion length and uniqueness constraints from both theoretical and practical 

f^) | perspectives. We give a greedy algorithm that achieves a logarithmic ap- 

proximation factor for the problem of minimizing the number of primers 
f*"^ ■ subject to a given upperbound on the length of PCR amplification prod- 

^yi ' ucts. We also give, using randomized rounding, the first non-trivial ap- 

fi , proximation algorithm for a version of the problem that requires unique 

amplification of each amplification target. Empirical results on randomly 
generated testcases as well as testcases extracted from the from the Na- 
tional Center for Biotechnology Information's genomic databases show 
that our algorithms are highly scalable and produce better results com- 



X 

d ' pared to previous heuristics. 



1 Introduction 

Availability of full genome data combined with rapid advances in high-throughput 
genomic technologies promises to revolutionize medical science by enabling large 
scale genomic analyses such as association studies between Single Nucleotide 
Polymorphisms (SNPs) and susceptibility to common diseases. Although re- 
cent work [3] suggests that there are only a few hundred thousand "blocks" of 
SNPs that recombine to provide most of the genetic variability seen in human 
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search Foundation. 



populations, meaningful SNP association studies will still require geno typing 
many thousands of SNPs in large populations. 1 This poses a daunting chal- 
lenge to current SNP genotyping protocols (see [S] for a survey) . A critical step 
in these protocols is the cost-effective amplification of DNA sequences contain- 
ing the SNP loci of interest via biochemical reactions such as the Polymerase 
Chain Reaction (PCR). 

PCR cleverly exploits the DNA replication machinery to create up to millions 
of copies of specific DNA fragments (amplification targets). In its basic form, 
PCR requires a pair of oligonucleotides (short single-stranded DNA sequences 
called primers) for each amplification target. More precisely, the two primers 
must be (perfect or near perfect) reversed Watson-Crick complements of the 3' 
ends of the forward and reverse strands in the double-stranded amplification 
target (see Figure 0. 

Typically there is significant freedom in selecting the exact ends of an am- 
plification target, i.e., in selecting PCR primers. Consequently, primer selection 
can be optimized with respect to various criteria affecting reaction efficiency, 
such as primer length, specificity, melting temperature, secondary structure, 
etc. Since the efficiency of PCR amplification falls off exponentially as the 
length of the amplification product increases, an important practical constraint 
is that the binding sites for the two primers must be within a certain maximum 
distance of each other (typically around 1000 bases). 

Much of the previous work on PCR primer selection has focused on single 
primer pair optimization with respect to the above biochemical criteria. This 
line of work has resulted in the release of several robust software tools for primer 
pair selection, the best known of which is the Primer 3 package [§]. Another op- 
timization objective studied in the literature is the minimization of the number 
of PCR primers required to carry out a given set of independent amplifications. 
Pearson et al. |Sj were the first to consider this objective in their optimal primer 
cover problem formulation: given a set of DNA sequences and an integer k, find 
the minimum number of fc-mers that cover all sequences. They showed that 
the primer cover problem is as hard to approximate as set cover, and hence un- 
likely to be approximable within a factor better than (1 — o(l))0(logn), where 
n is the number of DNA sequences. Pearson et al. also proposed an exact 
branch-and-bound algorithm for the primer cover problem and showed that the 
classical greedy set cover algorithm guarantees a theoretically optimum 0(log n) 
approximation factor. 

Multiplex PCR (MP-PCR) is a variation of PCR in which multiple DNA 
fragments are amplified simultaneously. Like the basic PCR, MP-PCR makes 
use of two oligonucleotide primers to define the boundaries of each amplification 
target. Note, however, that MP-PCR amplified targets are available only as a 
mixture and it may not be possible or cost-effective to separate them to the 
purity required, e.g., in microarray spotting. Fortunately, this is not limiting 
the applicability of MP-PCR to SNP genotyping, since most of the existing al- 
lelic discrimination methods are highly-parallel and thus can be applied directly 
to mixtures of amplified SNP loci 5 . Furthermore, effectiveness of allelic dis- 

For example, fully powered haplotype association studies are estimated to require as much 
as 300,000 to 1,000,000 "haplotype-tag" SNPs 0. 



crimination methods is largely unaffected by the presence of a small number of 
undesired amplification products, which may occur in MP-PCR. 

A promising approach to further increasing MP-PCR efficiency is the use of 
degenerate PCR primers |Hj- 2 A degenerate primer is essentially a mixture con- 
sisting of multiple non-degenerate primers sharing a common pattern and can 
thus be used to simultaneously amplify many different SNP loci. For example, 
letting N to denote a position in the primer sequence where all 4 nucleotides can 
appear in equal proportions, the degenerate primer aNgNc represents a mixture 
of 16 different non-degenerate primers {aagac,aagcc,aaggc,aagtc, . . .,atgtc). 
Remarkably, degenerate primer cost is nearly identical to that of non-degenerate 
primers, since the synthesis requires the same number of steps (the only differ- 
ence is that one must add multiple nucleotides in some of the synthesis steps). 
However, since not all non-degenerate primers present in the degenerate primer 
mixture are useful, it is important to use only degenerate primers with bounded 
degeneracy. Linhart and Shamir [7] proved the NP-hardness of several formula- 
tions for the degenerate primer design problem, including a formulation which 
asks for a degenerate primer with minimum degeneracy that covers a given set 
of input strings. Souvenir et al. (Tl proposed an iterative beam-search heuristic 
for the related multiple degenerate primer design problem, which seeks a mini- 
mum number of degenerate primers, each with bounded degeneracy, covering a 
given set of DNA sequences. 3 

A common feature of the string covering formulations in [5J El 111) is that 
they decouple the selection of forward and reverse primers, and, in particular, 
cannot explicitly enforce bounds on PCR amplification length. Such bounds can 
be enforced only by conservatively defining the allowable primer binding regions 
(i.e., the DNA segments to be covered). For example, in order to guarantee a 
distance of L between the forward and reverse primer binding sites around a 
SNP, one may confine the search to primers binding within L/2 nucleotides of 
the SNP locus. However, since this constraint reduces the number of candidate 
primer pairs by a factor of about 2, 4 adopting this approach can lead to sig- 
nificant sub-optimality in the number of primers required to amplify all SNP 
loci. 

Motivated by the requirement of unique PCR amplification in synthesis of 
spotted microarrays, Fernandes and Skiena [5] introduced an elegant minimum 
multi-colored subgraph formulation for the primer selection problem. In this 
formulation, each candidate primer is represented as a graph node and every 
two primers that uniquely amplify a desired target (e.g., gene) are connected 
by an edge labeled (or "colored") by the target. The goal is to find a minimum 
subset of the nodes inducing edges of all possible colors. Fernandes and Skiena 
gave practical greedy and densest-subgraph based heuristics for the minimum 

2 Another approach is to use PCR primers that complement interspersed repetitive se- 
quences, such as the human Alu sequence. Since the position of the interspersed repetitive 
sequences highly constrains the set of SNP loci that can be amplified, this approach is generally 
not applicable when a specific set of SNPs is targeted. 

3 The iterative beam-search heuristic of 1111 is also applicable when a threshold is given for 
the total-degeneracy of the set of primers rather than individual degeneracies. 

4 E.g., assuming that all DNA fc-mers can be used as primers, out of the (L — k + 1)(L — 
k + 2)/2 pairs of forward and reverse fc-mers that can feasibly amplify a SNP locus, only 
(L — k + l) 2 /4 have both fc-mers within L/2 bases of this locus. 



multi-colored subgraph and showed that the problem cannot be approximated 
within a factor better than (1 — o(l))lnn — o(l), where n is the number of 
amplification targets. While finding a minimum primer set that amplifies a given 
set of SNPs subject to amplification length constraints can be reduced to the 
minimum multi-colored subgraph problem, no non-trivial approximation factor 
is known for the latter problem once unique amplification is no longer required. 
With unique amplification constraints, the trivial algorithm of selecting two 
arbitrary primers for each of the n amplification target gives an approximation 
factor of y/n. 

In this paper we study (degenerate and non-degenerate) PCR primer selec- 
tion problems with amplification length and uniqueness constraints from both 
theoretical and practical perspectives. Our contributions are as follows: 

• We give a new string-pair covering formulation for the minimum primer set 
selection with amplification length constraints problem, and show that a 
clever modification of the classical greedy algorithm for set cover achieves 
a near-optimal approximation factor of ln(ni), where n is the number 
of amplification targets and L is the upperbound on PCR amplification 
length. This result is complemented by a O (Inn) inapproximability result, 
which implies that the approximation factor of the greedy algorithm is 
optimal up to an additive term of O(lnL) 

• We give a randomized rounding algorithm with an approximation factor 
of 0(\/~m\ogm) for the minimum multi-colored subgraph problem of [2], 
where m is the maximum size of a color class (i.e., the maximum number 
of edges sharing the same color) and m is the number of colors. For the 
minimum primer set selection with uniqueness constraints m = 0(L 2 ) and 
m = n. Hence, our result implies an approximation factor of O(Llogn), 
which asymptotically improves over the trivial approximation bound of 
y/n. Furthermore, our algorithm has the same approximation guarantees 
for the minimum multi-colored subgraph problem without uniqueness re- 
quirements. 

• We give the results of a comprehensive experimental study comparing our 
greedy approximation algorithm with previously published primer selec- 
tion algorithms on randomly generated testcases as well as testcases ex- 
tracted from the National Center for Biotechnology Information's genomic 
databases pQ. 

The rest of the paper is organized as follows. In next section we introduce 
notations and give formal problem definitions. In Section we describe and 
analyze the greedy algorithm for the minimum primer set selection with am- 
plification length constraints problem. In Section 01 we give the randomized 
rounding algorithm for the minimum multi-colored subgraph problem. Finally 
we present experimental results in Section0and conclude with some open prob- 
lems in Section [S] 
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Figure 1: Strings /* and r l consist of the L DNA bases immediately preceding 
in 3' — 5' order the i-th amplification locus along the forward (respectively 
reverse) genomic sequence. If forward and reverse PCR primers cover p and 
r l at positions t, respectively t', then the PCR amplification product length is 
(2L + x) — (t + t 1 ), where x is the length of the amplification locus (x = 1 for SNP 
genotyping). Thus, amplification product length is at most L + x iff t + t' > L. 



2 Notations and Problem Formulations 



Let E = {a,c,g,t} be the DNA alphabet. We denote by E* the set of strings 
over E, and by A the empty string. Overloading notations, we use | • | to denote 
both the length of strings over E and the size of sets. For a string s and an 
integer t < \s\, we denote by s[l..t] the prefix of length t of s. 

Following |llj . we define a non-degenerate primer of length A: as a string from 
E fe . A degenerate nucleotide is a non-empty subset of E. A degenerate primer 
of length k, or simply a primer of length k, is a string did® . . . dk of degenerate 
nucleotides, and can equivalently be viewed as the set of non-degenerate primers 
x\X2 ■ ■ ■ Xk, Xi £ di. The degeneracy of a degenerate primer dicfo • • • dk is the 
number of non-degenerate primers it represents, i.e., n»=i N*l- 

We denote by L the given threshold on the PCR amplification length, and by 
f l (respectively r r ) the string consisting of the L DNA bases immediately pre- 
ceding in 3' — 5' order the i-th amplification locus along the forward (respectively 
reverse) DNA genomic sequence (see Figure 0. 

We say that degenerate primer p = did? ■ ■ • dk covers (or hybridizes at) posi- 
tion i of string s = S1S2 • • ■ s m iff i is the largest index such that SiSi+i . . . Si + k-i 
is the reversed Watson-Crick complement of one of the non-degenerate primers 
represented by p, i.e., iff Si+j is the Watson-Crick complement of one of the 
nucleotides in dk-j for every < j < k — 1. 

A set of degenerate primers P is an L-restricted primer cover for the pairs 



of sequences (f\r l ) 6 S L x S' 



1. 



iff for every i = 1, 



there 



exist primers p,p' E P, not necessarily distinct, and integers t,t' € {I, . . . ,L}, 

5 In practice, stable primer hybridization and subsequent PCR amplification occur even 
with a small number of mismatches if none of them is too close to the 3' end of the primer. 
Our algorithms apply unmodified to hybridization models allowing mismatches. 



such that 

1. p hybridizes at position t of f l ; 

2. p' hybridizes at position t' of r*; and 

3. t + t' > L 

The last constraint ensures that the PCR amplification product length is no 
more than L + x, where x is the length of the desired amplification target (x = 1 
for SNP genotyping). We say that a primer cover has the unique amplification 
property if, for each pair (/*, r l ), there exists exactly one set of primers {p,p'} £ 
P satisfying conditions 1-3 above. 

The minimum primer set selection problem with amplification length con- 
straints (MPSS-L) is defined as follows: Given primer length k, degeneracy 
upperbound 6, amplification length uppcrbound L, and n pairs of sequences 
(/ I i r *)j * = lj • - - j Tl, find a minimum size L-restricted primer cover consist- 
ing of degenerate primers of length k, each with degeneracy at most 5. The 
minimum primer set selection problem with amplification length and uniqueness 
constraints (MPSS-LU) is defined in the same way except that in this case we 
seek a minimum size L-restricted primer cover which has the unique amplifica- 
tion property. 

3 The Greedy Algorithm for MPSS-L 

MPSS-L can be viewed as a generalization of the partial set cover problem |1(J|. 
In the partial set cover problem one must cover with the minimum number of 
sets a given fraction of the total number of elements. In MPSS-L we can take 
the elements to be covered to be the non-empty prefixes of the In forward and 
reverse sequences; there are 2nL such elements. A primer p covers prefix f l [l..j] 
(r % [l..j]) if it hybridizes to /* (respectively r l ) at position t > j. The objective 
is to cover at least L (i.e., half) of the elements of {/ J [l..j], r J [l..j] | 1 < j < L} 
for every t g{l,...,n}. 

For a set of primers P, let / and r l denote the longest prefix of f 1 , re- 
spectively r l , covered by a primer in P. Note that \f | + \r l \ gives the num- 
ber of elements of {f l [l..j],r l [l..j] | 1 < j < L} that are covered by P. Let 

$(P) := mia{L, \f\ + \r l \}. Note that $(0) = 0, $(P) = nL for every feasible 
MPSS-L solution, and that $(P) < $(P') whenever P C P' . Hence, $(P) can 
be used as a measure of the progress made towards feasibility by a set P of 
primers. 

The greedy algorithm (see Figure |5J starts with an empty set of primers and 
iteratively selects primers which give the largest increase in <I> until reaching 
feasibility. 

Theorem 1 The greedy algorithm returns an L-restricted primer cover of size 
at most In(nL) times larger than the optimum. 



Input: Primer length k, degeneracy upperbound S, amplification 
upperbound L, and pairs of sequences (f l ,r l ) E E L x S L , i = 1, . 
Output: L-restricted primer cover P consisting of degenerate prii 
k, each with degeneracy at most 6 
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Figure 2: The greedy algorithm for MPSS-L 



Proof. Let OPT denote a minimum size L-restricted primer cover, and let 
Pi, . . . ,p g be the primers selected by the greedy algorithm. It can be verified 
that, for every A and B, <P(A(JB) < $(A) + '£ ip( ; B [$(Au{p})-$(A)]. By using 
this claim with A = {p\, . . . ,Pi-i} and B = OPT, it follows that in the step 
when the greedy algorithm selects Pi, there is a primer in OPT \ {pi, . . . ,p%-i} 
whose selection increases $ by at least (nL— $(P))/|OPT|. Hence, the selection 
oi Pi must increase $ by at least the same amount, i.e., reduce the difference 
between $(OPT) and $(P) by a factor of at least (1 - l/|OPT|). By induction 
we get that 

nL-<S>({p u ...,p i })<nL(l--^^) (1) 

which implies that the number of primers selected by the greedy algorithm is 
at most ln(nL). ■ 



Remark. In [Sj it is proved that the following primer cover problem is as 
hard to approximate as set cover: Given integer k and strings Si, . . . , s n , find 
a minimum set of fc-length primers covering all Sj's. A simple approximation 
preserving reduction of the primer cover problem to MPSS-L shows that the 
MPSS-L problem cannot be approximated within a factor better than (1 — 
o(l)) Inn unless NP C TIME(n°( lo s lo s™)). Hence, the approximation factor in 
Theorem^s tight up to an additive term of O(lnL). 

4 Rounding Algorithm for the Minimum Multi- 
colored Subgraph Problem 

In this section we consider a graph-theoretical generalization of the MPSS-LU 
problem. The minimum multi-colored subgraph problem [5] is defined as follows. 
Let G = (V, E) be an undirected graph and xi> • • • » Xfc C E a family of nonempty 
"color classes" of edges with the property that (J^ Xi = E. Assigning X — 
(Xi,---,Xk), let I(G,X) denote the minimum size of a set of vertices I for 
which the subgraph induced by these vertices contains at least one edge of each 
color. Note that 2 < T(G,X) < 2\X\ and, as an edge may belong to several 
distinct color classes, both of these extreme values are in fact possible. 

The problem of computing T(G,X) is NP-hard, via, e.g., a natural reduc- 
tion from set-cover. We show below that it can be approximated to within 
0(y/m&x x I x I l°g I A |) in polynomial time. 

Theorem 2 X(G,X) can be approximated to within 0(-\/mlog| X |) in polyno- 
mial time, where m = max x£ x | X I ■ 



Proof. We begin with the following integer program formulation of this 
optimization problem 

min y^ x v , subject to 

V 

VxeA,^ e >i , 
VveV,VxeX, Y, Ve<x v , 

u£e£x 

Vee E,y e >0,Vue V,x v >0 . 

Relaxing this formulation by allowing the variables x v and y e to take values 
in [0, 1] results in a linear program, the optimum value for which we denote 
Te(G,X). We begin by scaling the linear program to obtain the following new 



linear program: 

min \ x v , subject to 

V 

V x eI,^i/ e >Vn; , 

Vv e V, Vx e x, ^2 Ve<x v , 

Me GE,y e >0,MvGV,x v >0 . 

Let Xf(G,X) denote the optimum value for this scaled version, and note that 
Xf(G,X) < ^/m ■ Te(G,X) by scaling any solution that achieves the value 
It(G,X) by the factor y/m: let x* G M. v and y* G R E denote a feasible so- 
lution to the program above, achieving the optimum value I^(G,X). 

Based on the solution {x*,y*) above, define a family of (artificial) indepen- 
dent {0, l}-valued random variables 

{Z VtC | v G e, v G V, e G E} 

where Pv[Z v _ e = 1] = p e = min(y|, 1) for each v G e. In terms of these variables, 
define, for each v G V and each (u, v) = e G E, the variables 

X v = W Z VtB and Y e = Z UtB Z v ^ e . 

veeeE 

Finally, we let the variables X u determine a random set of vertices S = {v \ 
X v = 1}. Our goal is to show that, for each color class \i the set 5 is likely to 
induce an edge in x- 

Comment. Observe that indicator variable for the event that the set S 
induces the edge e = (u, v) is X U X V which dominates the variable Yi uv y We 
focus on this second, less natural, set of variables because, unlike the variables 
X u X Vl the Y(u,v) are independent. 

With this in mind, note that Pr[Y" e = 1] = (p e ) 2 and that for each v 

Pr[« G S] = Pr [X v = 1] = f 1 - JJ Pr[Z B , e = 0] j = ( 1 ]^[(1 i? e ) ) 

<(i-(i-E^))^E^<< • 

Hence, by linearity of expectation 



Exp[|5|] =Exp 



E*« 



<lf(G,X) <y/m-li(G,X) < y/m-l(G,X) 



We wish to upper bound, for each color class Xi the quantity 

Pr[Ve G x, y e = 0] = Pr [S induces no edge from x] 



with the intention of showing that this selection S of vertices is likely to induce 
many color classes. So, consider now an arbitrary color class x; then 



Exp 



7 J X U X V 

-e6x 



> Exp 



eGx 



Ye 



= E^>|X|- B) >1, 



e£x 



as X^eGx ^ e — V^ an d the function a: 
are independent, we compute 



x 2 is convex. Considering that the Y e 



Pr[x not induced by S] = Pr[V(w, u) e x, X U X V = 0] < Pr [Ve e x, Y" e = 0] 



e6X 



< 



eGx 



-E eex Pe > 



Evidently, selection of S as above "covers" any individual class x with constant 
probability. So, finally, consider the set of vertices obtained by (i.) repeating the 
above procedure t = (log | X | + 2) times, resulting in the vertex sets Si, . . . , St 
followed by (ii.) forming the union S = |L Si. Then 

Exp[| S |] < Vm(log | X | + 2) • I(G, X) 

so that by Markov's inequality, the probability that | S | exceeds this value by a 
factor 3 is no more than 1/3. In addition, the probability that S fails to induce 
an edge in all of the color classes is 



Prpx € X, no edge of \ induced by S] < | X | • (e x ) 



_lslog| x |+2 



< 1/3 



Hence with constant probability this procedure results in a collection of vertices 
that induces at least one edge of each color class and has cardinality no more 
than 0(Vmlog | X \)1(G, X), as desired. ■ 

We show below that the integrality gap of the LP defining Tg (G, X) is 0(\/m) 
in general. This suggests that this particular LP formulation may have limited 
value in achieving approximation results beyond the \/m threshold. 

Theorem 3 For every s > there is a pair (G,X) for which m — s and 
l{G,X)>n(^/m)le(G,X). 

Proof. Consider the graph on n 3> s vertices obtained by selecting, inde- 
pendently and uniformly at random, n matchings Xi> • • • >Xn each of size s and 
assigning E = IJxi- Observe that the feasible solution obtained by setting 
x v = He = V s f° r au " e ano - v implies that Ii{G, X) < n/s. 

On the other hand, we show that with high probability, this random selec- 
tion of matchings results in a graph for which the smallest integer solution has 
objective value at least I = (n — l)/\/2s. Specifically, let L C V be a fixed col- 
lection of t vertices and note that the probability that any given edge induced 
by L is included in, e.g., \\ is s/(™); hence the probability that L induces an 
edge of each color is no more than 




10 



Hence the probability that some set of I vertices induces an edge of each color 
is no more than ("]2~ m < 1 for m > n. Evidently, there exists a family of color 
classes X = (xi, ■ ■ ■ , Xm) for which 1(G, X) > 9( v / m)J^(G, X), as desired. ■ 



5 Experimental Results 

We performed experiments on both randomly generated MPSS-L instances 
and instances extracted from the human genome databases. Random DNA 
sequences were generated from the uniform distribution induced by assigning 
equal probabilities for each nucleotide. The DNA sequences consisted of regions 
surrounding 100 known SNPs collected from National Center for Biotechnology 
Information's genomic databases 0. 

For all experiments we used a bound L = 1000 on the PCR amplification 
length. In all experiments we considered only non-degenerate primers (6 = 1) 
with length k between 8 and 12. These values model the restricted degenerate 
primer format suggested and experimentally validated by Jordan et al. |J|. In 
this format, 8-12 nucleotides at the 3' end of each primer are fully specified, 
followed by a middle sequence of up to 6 fully degenerate nucleotides, followed 
by a fixed GC-rich sequence (CTCGAG in gj) at the 5' end. 

We compared the following four algorithms: 

• The greedy primer cover algorithm of (SJ (G-FIX). In this algorithm the 
candidate primers are collected from the reverse and forward sequences 
within a distance of L/2 around the SNP. This ensures that our final 
solution is a set of primers that meets the product length constraints. 
The algorithm repeatedly selects the candidate primer that covers the 
maximum number of not yet covered forward and reverse sequences. 

• A naive modification of G-FIX, which we call G-VAR, in which the candi- 
date primers are initially collected from the reverse and forward sequences 
within a distance of L around the SNP. The algorithm proceeds by greed- 
ily selecting primers like G-FIX, except that after a first primer p covers 
one of the forward or reverse sequences corresponding to a SNP at position 
t, we truncate the opposite sequence to a length of L — t, thus ensuring 
that the final primer cover is L-restricted. 

• The greedy approximation algorithm from Figure [21 called G-POT since 
it makes greedy choices based on the "potential function" $. 

• The iterative beam-search heuristic of Souvenir et al. (TTJ. We used 
the primer-threshold version of this heuristic, MIPS-PT, with degeneracy 
bound set to 1 and the default beam size of 100. 

Table Ogives the number of primers selected and the running time (in CPU 
seconds) for the three greedy algorithms and for the iterative beam-search MIPS- 
PT heuristic of|llj on instances extracted from the NCBI repository. G-POT 
has the best performance on all testcases, reducing the number of primers by 
up to 24% compared to G-FIX and up to 30% compared to G-VAR. G-VAR 
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# 


k 


G-FIX 


G-VAR 


MIPS-PT 


G-POT 


SNPs 




#Primers CPU sec. 


#Primers CPU sec. 


#Primers CPU sec. 


#Primers CPU sec. 


50 


8 


13 0.13 


15 0.30 


21 48 


10 0.32 


50 


10 


23 0.22 


24 0.36 


30 150 


18 0.33 


50 


12 


31 0.14 


32 0.30 


41 246 


29 0.28 


100 


8 


17 0.49 


20 0.89 


32 226 


14 0.58 


100 


10 


37 0.37 


37 0.72 


50 844 


31 0.75 


100 


12 


53 0.59 


48 0.84 


75 2601 


42 0.61 



Table 1: Results on instances extracted from NCBI repository (L = 1000). 



performance is neither dominated nor dominating that of G-FIX. On the other 
hand, the much slower MIPS-PT heuristic has the poorest performance, possibly 
because is fine-tuned to perform well with higher degeneracy primers. 

To further characterize the performance of compared algorithms, in Figure 
[3Ia-c) we plot the average solution quality of the three greedy algorithms versus 
the number of target SNPs (on a log scale) for randomly generated testcases. 
MIPS was not included in this comparison due to its prohibitive running time. 
In order to facilitate comparisons across instance sizes, the size of the primer 
cover is normalized by the double of the number of SNPs, which is the size of 
the trivial cover obtained by using two distinct primers to amplify each SNP. 
Although the improvement is highly dependent on primer length and number 
of SNPs, G-POT is still consistently outperforming the G-FIX algorithm of[5], 
and, with few exceptions, its G-VAR modification. 

Figure Eld) gives the log-log plot of the average CPU running time (in sec- 
onds) versus the number of pairs of sequences for primers of size 10 and randomly 
generated pairs of sequences. All experiments were run on a PowerEdge 2600 
Linux server with 4 Gb of RAM and dual 2.8 GHz Intel Xeon CPUs - only 
one of which is used by our sequential algorithms - using the same compiler 
optimization options. The runtime of all three greedy algorithms grows linearly 
with the number of SNPs, with G-VAR and G-POT incurring only a small factor 
penalty in runtime compared to G-FIX. This suggests that a robust practical 
heuristic is to run all three algorithms and return the best of the three solutions 
found. 



6 Open Problems 

While the logarithmic approximation factor achieved by our greedy algorithm 
for PCR primer set selection with an amplification length constraint of L is 
optimal within an additive factor of O(lni), the gap between the O(lnn) inap- 
proximability bound established in 2 and the approximation factor of 0(L Inn) 
that we obtain for PCR primer set selection with uniqueness constraints is less 
satisfactory. Closing this gap, either directly or via improved approximations for 
the minimum multi-colored subgraph problem, is an interesting open problem. 
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Figure 3: (a)-(c) Performance of the compared algorithms, measured by rel- 
ative improvement over the trivial solution of using two primers per SNP for 
fc = 8, 10, 12, L = 1000, and up to 5000 SNPs. (d) Runtime of the compared 
algorithms for I — 10, L — 1000, and up to 5000 SNPs. Each number represents 
the average over 10 testcases of the respective size. 
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